February 27, 2019
Combining distant and close reading is an unfulfilled promise: Software often inhibits combining both perspectives. How to implement workflows for coding and annotating textual data? The polmineR R package (and a few brothers and sisters) are presented as potential solution.
Special focus: Interactive graph annotation as an approach to generate intersubjectively shared interpretations/understandings of discourse patterns.
The “triple”:
polmineR: basic vocabulary for corpus analysis
RcppCWB: wrapper for the Corpus Workbench (using C++/Rcpp, follow-up on rcqp-package)
cwbtools: tools to create and manage CWB indexed corpora
Beyond the triple:
performance: if analysis is slow, interaction with the data suffers
portability: painless installation on all major platforms
open source: no restrictions and inhibiting licenses
usability: make full use of the RStudio IDE
documentation: transparency of the methods implemented
theory is code: combine quantitative and qualitative methods
install.packages("polmineR")
drat::addRepo("polmine") # add CRAN-style repository to known repos
install.packages("GermaParl") # the downloaded package includes a small sample dataset
GermaParl::germaparl_download_corpus() # get the full corpus
library(polmineR)
use("GermaParl") # activate the corpora in the GermaParl package, i.e. GERMAPARL
create subcorpora: partition(), subset()
counting: hits(), count(), dispersion() (vgl.: size())
create term-document-matrices: as.TermDocumentMatrix()
get keywords / feature extraction: features()
compute cooccurrences: cooccuurrences(), Cooccurrences()
inspect concordances: kwic()
recover full text: get_token_stream(), html(), read()
p <- partition("GERMAPARL", year = 2001)
m <- partition("GERMAPARL", speaker = "Merkel", regex = TRUE)
am <- corpus("GERMAPARL") %>% subset(speaker == "Angela Merkel")
m <- corpus("GERMAPARL") %>% subset(grep("Merkel", speaker)) # Petra Merkel!
cdu_csu <- corpus("GERMAPARL") %>%
subset(party %in% c("CDU", "CSU")) %>%
subset(role != "presidency")
dt <- dispersion("GERMAPARL", query = "Flüchtlinge", s_attribute = "year")
barplot(height = dt$count, names.arg = dt$year, las = 2, ylab = "Häufigkeit")
q <- '[pos = "NN"] "mit" "Migrationshintergrund"'
corpus("GERMAPARL") %>% kwic(query = q, cqp = TRUE, left = 10, right = 10)
kwic("GERMAPARL", query = "Islam", positivelist = c(good, bad)) %>%
highlight(lightgreen = good, orange = bad) %>%
tooltips(setNames(SentiWS[["word"]], SentiWS[["weight"]])) %>%
knit_print()
partition("GERMAPARL", date = "2009-11-10", speaker = "Merkel", regex = T) %>%
html(height = "250px") %>%
highlight(list(yellow = c("Bundestag", "Regierung")))
p <- partition("BE", date = "2005-04-28", speaker = "Körting", regex = TRUE, verbose = FALSE)
sc <- corpus("BE") %>% subset(date == "2005-04-28") %>% subset(grepl("Körting", speaker))
ek <- as.speeches(p, s_attribute_name = "speaker", verbose = FALSE)[[4]]
h <- get_highlight_list(BE_lda, partition_obj = ek, no_token = 150)
lapply(h, function(x) x[1:8])
html(p, height = "350px") %>% highlight(highlight = h)
Prepared in the MigTex Project (“Textressourcen für die Migrations- und Integrationsforschung”, funding: BMFSFJ)
Preparation of all plenary debates in Germany’s regional parliaments (2000-2018) using the “Framework for Parsing Plenary Protocols” (frappp-package)
Extraction of a thematic subcorpus using unsupervised learning (topic modelling)
Size of the MigParl corpus: 27241205 tokens
size without interjections and presidency: 22837376
structural annotation: id | speaker | party | role | lp | session | date | regional_state | interjection | year | agenda_item | agenda_item_type | speech | topics | harmonized_topics


cooccurrences()-method can be applied to subcorpora / partitions, and corpora.cooccurrences("GERMAPARL", query = 'Islam', left = 10, right = 10)
Starting with polmineR v0.7.9.11, the package includes a method to efficiently calculate all cooccurrences in a corpus.
m <- partition("GERMAPARL", year = 2008, speaker = "Angela Merkel", interjection = F)
drop <- terms(m, p_attribute = "word") %>% noise() %>% unlist()
Cooccurrences(m, p_attribute = "word", left = 5L, right = 5L, stoplist = drop) %>%
decode() %>% #
ll() %>%
subset(ll >= 10.83) %>%
subset(ab_count >= 5) -> coocs
Dependence of graph layout on filter decisions
Difficulties to justify filter decisions
Possibilities to provide extra information, but perils of information overload
Overwhelming complexity of large graphs
How to handle the complexity and create the foundations for intersubjectivity?
politeness of AfD
no autism? Interaction with other parties (and visitors!)
Cultivating antagonismus: “Wir” (AfD / AfD-Fraktion) and the others
It’s the economy: Introducing a redistributive Logic as a leitmotiv
“visual hermeneutics” (G. Schaal) - graph annotation as an approach to realise ideal of distant and close reading